Skip to content

Conversation

@AlexYinHan
Copy link
Contributor

What is the purpose of the change

This PR fixes FLINK-38336, by allowing ForSt statebackend to reuse the restored files in failover scenario.

Brief change log

  • Reorganize the local/remote path of ForSt into a ForStPathContainer
  • Determine whether we are in a failover scenario by comparing the DB path and the file paths stored in StateHandles
  • Use ReusableDataTransferStrategy if we are in a failover scenario

Verifying this change

This change added tests and can be verified as follows:

  • DataTransferStrategyTest#testBuildingStrategyAsExpected

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 25, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@AlexYinHan AlexYinHan force-pushed the yh/dev_38336 branch 2 times, most recently from 9c4c31e to a8460bc Compare September 29, 2025 03:25
.getSharedStateDirectory();
FsCheckpointStorageAccess fsCheckpointStorageAccess =
(FsCheckpointStorageAccess) env.getCheckpointStorageAccess();
remoteJobPath = fsCheckpointStorageAccess.getCheckpointsDirectory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only share state from the shared directory, so what exactly the remoteJobPath means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remoteJobPath is the parent directory that ends with 'JobID'.

For example:

  • remoteShareWithCheckpoint==True: If the ForSt base DIR is /checkpoints/jobid-xxx/shared/op_yyy__1_1__attempt_0, then remoteJobPath is /checkpoints/jobid-xxx.

@Nullable Path remoteBasePath) {
this.localJobPath = localJobPath;
this.localBasePath = localBasePath;
this.localForStPath = localBasePath != null ? new Path(localBasePath, DB_DIR_STRING) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have localJobPath, localBasePath and localForStPath which is derived from localBasePath.

I was expecting localJobpath to be a subfolder of localBasePath, is this true? Is this the bean to validate that?

I see localBasePath can be null, in this case localForStPath is set to null, but localJobPath can have a value. What does this mean? The code indicates that it is possible to run without Forst but have a . Have I understood this correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, localJobpath is a subfolder of localBasePath.

basePath,jobPath are allowed to be null, just to stay consistent with the code before this PR. Currently the local paths can only be null in UT tests.


import javax.annotation.Nullable;

/** Container for ForSt paths. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be worth describing each path here. The text talks of Forst paths , but there are only 2 of the 6 that mention Forst - localForStPath and remoteForStPath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments have been added to the code.

if (pathContainer.getRemoteForStPath() != null
&& pathContainer.getLocalForStPath() != null) {
if (cacheBasePath == null && pathContainer.getLocalBasePath() != null) {
cacheBasePath = new Path(pathContainer.getLocalBasePath().getPath(), "cache");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume thr log is incorrect, not should be now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log is correct. It means the cache path is NOT specified in the configurations, so we set it to local base path.

public void forceClearRemoteDirectories() throws Exception {
if (remoteBasePath != null && remotePathNewlyCreated) {
clearDirectories(remoteBasePath);
if (pathContainer.getRemoteBasePath() != null && remotePathNewlyCreated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forceClear does not seem to force anything, in fact it will only clear the directories in a specific case when remotePathNewlyCreated is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is to fix FLINK-38433

optionsContainer.getLocalBasePath(),
optionsContainer.getRemoteBasePath(),
ex);
LOG.warn("Could not delete ForSt: {}.", optionsContainer.getPathContainer(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems we lose information here. I read the log entry text before Could not delete ForSt local working directory {} to means it is only the local directory that could not be deleted and we put out the remote directory for information. It would be good to point to the actual folder that could not be deleted in the new log message - rather than list all the folders and not know which could not be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 'the actual folder that could not be deleted' should be included in the ex, so i think there is no info lost here.

RecoveryClaimMode.CLAIM,
CopyDataTransferStrategy.class);

testRestoreStrategyAsExpected(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterized test for all these permutations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i've had a thought of that. However, these test cases have already been simple enough, and parameterizing them would not reduce the amount of code, but only make them harder to read.

@VisibleForTesting
Path getLocalBasePath() {
return optionsContainer.getLocalBasePath();
return optionsContainer.getPathContainer().getLocalBasePath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is optionsContainer.getPathContainer() ever null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. pathContainer is final and it should never be null

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 7, 2025
Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix

@Zakelly Zakelly merged commit 78f6e77 into apache:master Oct 14, 2025
AlexYinHan added a commit to AlexYinHan/flink that referenced this pull request Oct 14, 2025
AlexYinHan added a commit to AlexYinHan/flink that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants